A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

نویسندگان

Thomas Furmston

David Barber

چکیده

Parametric policy search algorithms are one of the methods of choice for the optimisation of Markov Decision Processes, with Expectation Maximisation and natural gradient ascent being popular methods in this field. In this article we provide a unifying perspective of these two algorithms by showing that their searchdirections in the parameter space are closely related to the search-direction of an approximate Newton method. This analysis leads naturally to the consideration of this approximate Newton method as an alternative optimisation method for Markov Decision Processes. We are able to show that the algorithm has numerous desirable properties, absent in the naive application of Newton’s method, that make it a viable alternative to either Expectation Maximisation or natural gradient ascent. Empirical results suggest that the algorithm has excellent convergence and robustness properties, performing strongly in comparison to both Expectation Maximisation and natural gradient ascent. 1 Markov Decision Processes Markov Decision Processes (MDPs) are the most commonly used model for the description of sequential decision making processes in a fully observable environment, see e.g. [5]. A MDP is described by the tuple {S,A, H, p1, p, π,R}, where S and A are sets known respectively as the state and action space, H ∈ N is the planning horizon, which can be either finite or infinite, and {p1, p, π,R} are functions that are referred as the initial state distribution, transition dynamics, policy (or controller) and the reward function. In general the state and action spaces can be arbitrary sets, but we restrict our attention to either discrete sets or subsets of R, where n ∈ N. We use boldface notation to represent a vector and also use the notation z = (s,a) to denote a state-action pair. Given a MDP the trajectory of the agent is determined by the following recursive procedure: Given the agent’s state, st, at a given time-point, t ∈ NH , an action is selected according to the policy, at ∼ π(·|st); The agent will then transition to a new state according to the transition dynamics, st+1 ∼ p(·|at, st); this process is iterated sequentially through all of the time-points in the planning horizon, where the state of the initial time-point is determined by the initial state distribution s1 ∼ p1(·). At each time-point the agent receives a (scalar) reward that is determined by the reward function, where this function depends on the current action and state of the environment. Typically the reward function is assumed to be bounded, but as the objective is linear in the reward function we assume w.l.o.g that it is non-negative. The most widely used objective in the MDP framework is to maximise the total expected reward of the agent over the course of the planning horizon. This objective can take various forms, including an infinite planning horizon, with either discounted or average rewards, or a finite planning horizon. The theoretical contributions of this paper are applicable to all three frameworks, but for notational ease and for reasons of space we concern ourselves with the infinite horizon framework with discounted rewards. In this framework the boundedness of the objective function is ensured by the

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Approximate Newton Methods for Policy Search in Markov Decision Processes

Approximate Newton methods are standard optimization tools which aim to maintain the benefits of Newton’s method, such as a fast rate of convergence, while alleviating its drawbacks, such as computationally expensive calculation or estimation of the inverse Hessian. In this work we investigate approximate Newton methods for policy optimization in Markov decision processes (MDPs). We first analy...

متن کامل

Non-parametric Policy Search with Limited Information Loss

Learning complex control policies from non-linear and redundant sensory input is an important challenge for reinforcement learning algorithms. Non-parametric methods that approximate values functions or transition models can address this problem, by adapting to the complexity of the data set. Yet, many current non-parametric approaches rely on unstable greedy maximization of approximate value f...

متن کامل

A Gauss-Newton Method for Markov Decision Processes

Approximate Newton methods are a standard optimization tool which aim to maintain the benefits of Newton’s method, such as a fast rate of convergence, whilst alleviating its drawbacks, such as computationally expensive calculation or estimation of the inverse Hessian. In this work we investigate approximate Newton methods for policy optimization in Markov decision processes (MDPs). We first ana...

متن کامل

A Unifying Framework for Temporal Abstraction in Stochastic Processes

This paper presents a framework for unifying the large and growing body of literature that deals with what broadly can be defined as temporal abstraction in Markov Decision Processes (MDPs). MDPs provide an appealing formal framework for modeling a large variety of stochastic problems. The main drawback of this approach is that a requirement of the formal model, i.e., the Markov property, typic...

متن کامل

Policy search in kernel Hilbert space

Much recent work in reinforcement learning and stochastic optimal control has focused on algorithms that search directly through a space of policies rather than building approximate value functions. Policy search has numerous advantages: it does not rely on the Markov assumption, domain knowledge may be encoded in a policy, the policy may require less representational power than a value-functio...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

نویسندگان

چکیده

منابع مشابه

Approximate Newton Methods for Policy Search in Markov Decision Processes

Non-parametric Policy Search with Limited Information Loss

A Gauss-Newton Method for Markov Decision Processes

A Unifying Framework for Temporal Abstraction in Stochastic Processes

Policy search in kernel Hilbert space

عنوان ژورنال:

اشتراک گذاری